etcd 时钟不同步问题

现象:集群版大脑3台etcd启动正常,etcdclient起不来,会报超时错误

1
2
worker called
{"level":"warn","ts":"2020-12-31T01:49:48.985+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-b0d5829c-b338-4f46-a7a4-637742899c26/etcd2:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}

etcdclient启动时会连接etcd,连不上就会报如上错误。

在其余两台机器上通过如下命令连接访问3个etcd,发现etcd2链接不上

1
etcdctl  get "/"  --prefix=true --endpoints=etcd2:2379

为何etcd2会链接不上嘞

查看etcd2的日志,发现etcd正常启动后一直有timeout超时的现象,数据没同步,但是一直在运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2020-12-31 10:41:00.322098 I | etcdserver: starting server... [version: 3.3.8, cluster version: 3.3]
2020-12-31 10:41:00.326133 I | rafthttp: peer c444e9acc71cc0dd became active
2020-12-31 10:41:00.326211 I | rafthttp: established a TCP streaming connection with peer c444e9acc71cc0dd (stream Message writer)
2020-12-31 10:41:00.326276 I | rafthttp: established a TCP streaming connection with peer c444e9acc71cc0dd (stream MsgApp v2 writer)
2020-12-31 10:41:00.785946 I | rafthttp: peer e9a110a6ff0ea28f became active
2020-12-31 10:41:00.786016 I | rafthttp: established a TCP streaming connection with peer e9a110a6ff0ea28f (stream MsgApp v2 writer)
2020-12-31 10:41:00.786040 I | rafthttp: established a TCP streaming connection with peer e9a110a6ff0ea28f (stream Message writer)
2020-12-31 10:41:00.824821 I | etcdserver: 1631c938b2ee394f initialzed peer connection; fast-forwarding 18 ticks (election ticks 20) with 2 active peer(s)
2020-12-31 10:41:00.830611 I | rafthttp: established a TCP streaming connection with peer e9a110a6ff0ea28f (stream Message reader)
2020-12-31 10:41:00.830664 I | rafthttp: established a TCP streaming connection with peer e9a110a6ff0ea28f (stream MsgApp v2 reader)
2020-12-31 10:41:01.990208 W | rafthttp: closed an existing TCP streaming connection with peer c444e9acc71cc0dd (stream Message writer)
2020-12-31 10:41:01.990254 I | rafthttp: established a TCP streaming connection with peer c444e9acc71cc0dd (stream Message writer)
2020-12-31 10:41:01.990367 W | rafthttp: closed an existing TCP streaming connection with peer c444e9acc71cc0dd (stream MsgApp v2 writer)
2020-12-31 10:41:01.990398 I | rafthttp: established a TCP streaming connection with peer c444e9acc71cc0dd (stream MsgApp v2 writer)
2020-12-31 10:41:03.327048 E | rafthttp: failed to dial c444e9acc71cc0dd on stream MsgApp v2 (dial tcp 10.0.1.248:2380: getsockopt: connection refused)
2020-12-31 10:41:03.327090 I | rafthttp: peer c444e9acc71cc0dd became inactive
2020-12-31 10:41:03.330176 I | rafthttp: peer c444e9acc71cc0dd became active
2020-12-31 10:41:03.330205 I | rafthttp: established a TCP streaming connection with peer c444e9acc71cc0dd (stream MsgApp v2 reader)
2020-12-31 10:41:03.330956 E | rafthttp: failed to dial c444e9acc71cc0dd on stream Message (dial tcp 10.0.1.248:2380: getsockopt: connection refused)
2020-12-31 10:41:03.330983 I | rafthttp: peer c444e9acc71cc0dd became inactive
2020-12-31 10:41:03.333257 I | rafthttp: peer c444e9acc71cc0dd became active
2020-12-31 10:41:03.333284 I | rafthttp: established a TCP streaming connection with peer c444e9acc71cc0dd (stream Message reader)
2020-12-31 10:41:04.284317 I | raft: 1631c938b2ee394f is starting a new election at term 3529
2020-12-31 10:41:04.284402 I | raft: 1631c938b2ee394f became candidate at term 3530
2020-12-31 10:41:04.284440 I | raft: 1631c938b2ee394f received MsgVoteResp from 1631c938b2ee394f at term 3530
2020-12-31 10:41:04.284458 I | raft: 1631c938b2ee394f [logterm: 2675, index: 794420] sent MsgVote request to c444e9acc71cc0dd at term 3530
2020-12-31 10:41:04.284471 I | raft: 1631c938b2ee394f [logterm: 2675, index: 794420] sent MsgVote request to e9a110a6ff0ea28f at term 3530
2020-12-31 10:41:04.287340 I | raft: 1631c938b2ee394f received MsgVoteResp rejection from e9a110a6ff0ea28f at term 3530
2020-12-31 10:41:04.287389 I | raft: 1631c938b2ee394f [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections
2020-12-31 10:41:12.932287 I | raft: 1631c938b2ee394f [term: 3530] received a MsgVote message with higher term from c444e9acc71cc0dd [term: 3533]
2020-12-31 10:41:12.932349 I | raft: 1631c938b2ee394f became follower at term 3533
2020-12-31 10:41:12.932371 I | raft: 1631c938b2ee394f [logterm: 2675, index: 794420, vote: 0] cast MsgVote for c444e9acc71cc0dd [logterm: 3529, index: 10894] at term 3533
2020-12-31 10:41:12.934730 I | raft: raft.node: 1631c938b2ee394f elected leader c444e9acc71cc0dd at term 3533
2020-12-31 10:41:45.322685 E | etcdserver: publish error: etcdserver: request timed out, possibly due to connection lost
2020-12-31 10:42:30.322915 E | etcdserver: publish error: etcdserver: request timed out
2020-12-31 10:43:15.323199 E | etcdserver: publish error: etcdserver: request timed out
2020-12-31 10:44:00.323454 E | etcdserver: publish error: etcdserver: request timed out
2020-12-31 10:44:45.323774 E | etcdserver: publish error: etcdserver: request timed out
2020-12-31 10:45:30.324023 E | etcdserver: publish error: etcdserver: request timed out
2020-12-31 10:46:15.324324 E | etcdserver: publish error: etcdserver: request timed out
2020-12-31 10:47:00.324603 E | etcdserver: publish error: etcdserver: request timed out
2020-12-31 10:47:45.324880 E | etcdserver: publish error: etcdserver: request timed out

从日志中可以看到,前面其实etcd2时参与了选举的,后面就有timeout了,看另外两台etcd日志,没有发现什么异常

查看3台etcd上的socket连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:0A:00:01:1C
inet addr:10.0.1.28 Bcast:10.0.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:319806 errors:0 dropped:0 overruns:0 frame:0
TX packets:307674 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:31731794 (30.2 MiB) TX bytes:31390009 (29.9 MiB)

eth1 Link encap:Ethernet HWaddr 02:42:AC:12:00:11
inet addr:172.18.0.17 Bcast:172.18.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:303 errors:0 dropped:0 overruns:0 frame:0
TX packets:114 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:23294 (22.7 KiB) TX bytes:7924 (7.7 KiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:713 errors:0 dropped:0 overruns:0 frame:0
TX packets:713 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:37461 (36.5 KiB) TX bytes:37461 (36.5 KiB)

/ # netstat -anp|grep -v ffff
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.11:41690 0.0.0.0:* LISTEN -
tcp 0 0 10.0.1.28:41534 10.0.1.29:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.28:45136 10.0.1.21:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.28:60442 10.0.1.29:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.28:45113 10.0.1.21:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.28:45112 10.0.1.21:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.28:40744 10.0.1.29:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.28:60432 10.0.1.29:2380 ESTABLISHED 1/etcd
tcp 0 0 127.0.0.1:40218 127.0.0.1:2379 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.28:60433 10.0.1.29:2380 ESTABLISHED 1/etcd
tcp 0 0 :::2379 :::* LISTEN 1/etcd
tcp 0 0 :::2380 :::* LISTEN 1/etcd
udp 0 0 127.0.0.11:55139 0.0.0.0:* -

/ # ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:0A:00:01:1E
inet addr:10.0.1.30 Bcast:10.0.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:426358 errors:0 dropped:0 overruns:0 frame:0
TX packets:413014 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:41363544 (39.4 MiB) TX bytes:40187220 (38.3 MiB)

eth1 Link encap:Ethernet HWaddr 02:42:AC:12:00:0F
inet addr:172.18.0.15 Bcast:172.18.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:656 (656.0 B) TX bytes:0 (0.0 B)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:549 errors:0 dropped:0 overruns:0 frame:0
TX packets:549 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:23927 (23.3 KiB) TX bytes:23927 (23.3 KiB)

/ # netstat -anp|grep -v ffff
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.11:39909 0.0.0.0:* LISTEN -
tcp 0 0 10.0.1.30:34078 10.0.1.21:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.30:34061 10.0.1.21:2380 ESTABLISHED 1/etcd
tcp 0 0 127.0.0.1:37410 127.0.0.1:2379 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.30:34060 10.0.1.21:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.30:49638 10.0.1.27:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.30:34100 10.0.1.21:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.30:49676 10.0.1.27:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.30:49636 10.0.1.27:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.30:49652 10.0.1.27:2380 ESTABLISHED 1/etcd
tcp 0 0 :::2379 :::* LISTEN 1/etcd
tcp 0 0 :::2380 :::* LISTEN 1/etcd
udp 0 0 127.0.0.11:38382 0.0.0.0:* -




/ # ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:0A:00:01:1A
inet addr:10.0.1.26 Bcast:10.0.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:430368 errors:0 dropped:0 overruns:0 frame:0
TX packets:423141 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:37496743 (35.7 MiB) TX bytes:36964712 (35.2 MiB)

eth1 Link encap:Ethernet HWaddr 02:42:AC:12:00:04
inet addr:172.18.0.4 Bcast:172.18.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:243 errors:0 dropped:0 overruns:0 frame:0
TX packets:50 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:14918 (14.5 KiB) TX bytes:3444 (3.3 KiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:145 errors:0 dropped:0 overruns:0 frame:0
TX packets:145 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:10712 (10.4 KiB) TX bytes:10712 (10.4 KiB)

/ #
/ #
/ # netstat -anp|grep -v ffff
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.11:41731 0.0.0.0:* LISTEN -
tcp 0 0 10.0.1.26:58560 10.0.1.27:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.26:58534 10.0.1.27:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.26:60160 10.0.1.29:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.26:60156 10.0.1.29:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.26:60154 10.0.1.29:2380 ESTABLISHED 1/etcd
tcp 0 0 10.0.1.26:58536 10.0.1.27:2380 ESTABLISHED 1/etcd
tcp 129 0 :::2379 :::* LISTEN 1/etcd
tcp 0 0 :::2380 :::* LISTEN 1/etcd
udp 0 0 127.0.0.11:41761 0.0.0.0:* -

之前集群机器上出现过时钟不同步的问题,etcd会报如下警告

1
rafthttp: the clock difference against peer e9a110a6ff0ea28f is too high [5.222576339s > 1s]

时钟不同步引起的某节点数据不一致导致的
此时,我们可认为某节点出现故障,采用故障恢复方案